Multi-Level Hashing Dedup in HPC Storage Systems

نویسندگان

  • Eric Valenzuela
  • Yong Chen
چکیده

Reaching high ratios of data deduplication in High Performance Computing (HPC) is highly achievable. Prior art demonstrates magnitudes of reduction possible and 15 to 30 percent of redundant data can be removed on average using deduplication techniques. The objective of this research study is to design and experiment a dedup system to provide 100% data integrity without a possibility of losing data while reducing the need of costly byte-by-byte comparisons. Because data deduplication uses hashing algorithms, hash collisions will occur. Prior systems ignore byte-by-byte comparisons that are needed to handle collisions citing the probability is low. Our research focuses on investigating a multi-level dedup method to reduce byte-by-byte comparisons while providing 100% data integrity, and the implementation of multi-level hash functions while talking advantage of Xeon Phi many-core architecture to compute cryptographic fingerprints concurrently. Our current proof-of-concept evaluations with a deduplication file system, Lessfs, show promising results. Keywords— Deduplication; File Systems; Multi-Level Hashing; Performance; Storage

برای دانلود رایگان متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

Deduplicating Compressed Contents in Cloud Storage Environment

Data compression and deduplication are two common approaches to increasing storage efficiency in the cloud environment. Both users and cloud service providers have economic incentives to compress their data before storing it in the cloud. However, our analysis indicates that compressed packages of different data and differently compressed packages of the same data are usually fundamentally diff...

متن کامل

Toward Transparent Data Management in Multi-layer Storage Hierarchy of HPC Systems

Upcoming exascale high performance computing (HPC) systems are expected to comprise multi-tier storage hierarchy, and thus will necessitate innovative storage and I/O mechanisms. Traditional disk and block-based interfaces and file systems face severe challenges in utilizing capabilities of storage hierarchies due to the lack of hierarchy support and semantic interfaces. Object-based and semant...

متن کامل

Challenges and Opportunities of User-Level File Systems for HPC (Dagstuhl Seminar 17202)

The performance gap between magnetic disks and data processing on HPC systems has become that huge that an efficient data processing can only be achieved by introducing non-volatile memory (NVRAM) as a new storage tier. Although the benefits of hierarchical storage have been adequately demonstrated to the point that the newest leadership class HPC systems will employ burst buffers, critical que...

متن کامل

Object Storage: Scalable Bandwidth for HPC Clusters

This paper describes the Object Storage Architecture solution for cost-effective, high bandwidth storage in High Performance Computing (HPC) environments. An HPC environment requires a storage system to scale to very large sizes and performance without sacrificing cost-effectiveness nor ease of sharing and managing data. Traditional storage solutions, including disk-per-node, Storage-Area Netwo...

متن کامل

Novel HPC Technologies for Scalable CAE: The Case for Parallel I/O and File Systems

As HPC continues its aggressive platform migration from proprietary supercomputers and Unix servers to HPC clusters, expectations grow for clusters to meet the I/O demands of increasing fidelity in CAE modeling and data management in the CAE workflow. Cluster deployments have increased as organizations seek ways to costeffectively grow compute resources for CAE applications, and during this mig...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

عنوان ژورنال:

دوره   شماره 

صفحات  -

تاریخ انتشار 2014